Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 11

1.3 SEQUENCING DEPTH AND READ QUALITY

1.3.1 Sequencing Depth

The biological results and interpretation of sequencing data for the different sequencing

applications are greatly affected by the number of sequenced reads that cover the genomic

regions. Usually, multiple sequences overlap over certain regions of the genome. The

sequencing depth measures the average read abundance and it is calculated as the number

of bases of all sequenced short reads that match a genome divided by the length of that

genome if the genome size is known. If the reads are equal in length, the sequencing depth

is calculated as

(

)

(

)

Coverage

^{read length bp}

number of reads

genome size bp

(1.1)

If the reads are not equal in length, the coverage is calculated as

∑

(

)

Coverage

length of read

genome size bp

(1.2)

where n is the number of sequenced reads.

The sequencing coverage is expressed as the number of times the genome (e.g., 1X, 2X,

20X,…, etc.).

The sequencing depth affects the genomic assembly completeness, accuracy of de novo

assembly and reference-guided assembly, number of detected genes, gene expression lev-

els in RNA-Seq, variant calling, genotyping in the whole genome sequencing, microbial

identification and diversity analysis in metagenomics, and identification of protein–DNA

interaction in epigenetics. Therefore, it is important to investigate sequencing depth before

sequence analysis. The higher the number of times that bases are sequenced, the better the

quality of the data.

1.3.2 Base Call Quality

We have already discussed the different sequencing technologies which have different

sequencing approaches. However, at the end, each of these technologies attempts to infer

the order of the nucleic acid studied. The process of inferring the base (A, C, G, or T) at

specific position of the sequenced DNA fragment during the sequencing process is called

base calling. The sequencing platforms are not perfect and errors may occur during the

sequencing process when the machine tries to infer a base from each measured signal. For

all platforms, the strength of the signals and other characteristic features are measured

and interpreted by the base caller software. Errors affect the sequence data directly and

make them less reliable. Therefore, it is critical to know the probability of such errors so

that users can know the quality of their sequence data and can figure out how to deal with

those quality errors. Most platforms are equipped with base calling programs that assign